Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.12 - Performing Joins (clean one)")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
pets = spark.createDataFrame(
[
(1, 1, 'Bear'),
(2, 1, 'Chewie'),
(3, 2, 'Roger'),
], ['id', 'breed_id', 'nickname']
)
pets.toPandas()
id | breed_id | nickname | |
---|---|---|---|
0 | 1 | 1 | Bear |
1 | 2 | 1 | Chewie |
2 | 3 | 2 | Roger |
breeds = spark.createDataFrame(
[
(1, 'Pitbull', 10),
(2, 'Corgie', 20),
], ['id', 'name', 'average_height']
)
breeds.toPandas()
id | name | average_height | |
---|---|---|---|
0 | 1 | Pitbull | 10 |
1 | 2 | Corgie | 20 |
Performing Joins
There are typically two types of joins in sql:
Inner Join
is where 2 tables are joined on the basis of common columns mentioned in the ON clause.ie.
left.join(right, left[lkey] == right[rkey])
Natural Join
is where 2 tables are joined on the basis of all common columns.ie.
left.join(right, 'key')
source: https://stackoverflow.com/a/8696402
Question: Which is better? Is it just a style choice?
Option 1: Inner Join (w/Different Keys)
join_condition = pets['breed_id'] == breeds['id']
df = pets.join(breeds, join_condition)
df.toPandas()
id | breed_id | nickname | id | name | average_height | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Bear | 1 | Pitbull | 10 |
1 | 2 | 1 | Chewie | 1 | Pitbull | 10 |
2 | 3 | 2 | Roger | 2 | Corgie | 20 |
What Happened:
- We have 2 columns named
id
, but they refer to different things. - We can't uniquely reference these 2 columns (easily, still possible).
- Pretty long
join expression
.
This is not ideal. Let's try renaming
it before the join?
Option 2: Inner Join (w/Same Keys)
breeds = breeds.withColumnRenamed('id', 'breed_id')
join_condition = pets['breed_id'] == breeds['breed_id']
df = pets.join(breeds, join_condition)
df.toPandas()
id | breed_id | nickname | breed_id | name | average_height | |
---|---|---|---|---|---|---|
0 | 1 | 1 | Bear | 1 | Pitbull | 10 |
1 | 2 | 1 | Chewie | 1 | Pitbull | 10 |
2 | 3 | 2 | Roger | 2 | Corgie | 20 |
What Happened:
- We have 2 columns named
breed_id
which mean the same thing! - Duplicate columns appear in the result.
- Still pretty long
join expression
.
This is again not ideal.
Option 3: Natural Join
df = pets.join(breeds, 'breed_id')
df.toPandas()
breed_id | id | nickname | name | average_height | |
---|---|---|---|---|---|
0 | 1 | 1 | Bear | Pitbull | 10 |
1 | 1 | 2 | Chewie | Pitbull | 10 |
2 | 2 | 3 | Roger | Corgie | 20 |
What Happened:
- No duplicated column!
- No extra column!
- A single string required for the
join expression
(list of column/keys, if joining on multiple column/keys join).
Summary
Preforming a natural join
was the most elegant solution in terms of join expression
and the resulting df
.
NOTE: These rules also apply to the other join types (ie. left
and right
).
**Some might argue that you will need both join keys in the result for further transformations such as filter only the left or right key, but I would recommend doing this before the join, as this is more performant.